CHAPTER 8 Getting Your Data into the Computer 105
Collecting categorical data in
your research database
Setting up your data collection forms and database tables for categorical data
requires more thought than you may expect. You may assume you already know
how to record and enter categorical data. You just type in the values — such as
“United States,” “nurse,” or “Stage I” — right? Wrong! (But wouldn’t it be nice
if it were that simple?) The following sections look at some of the issues you have
to address when storing categorical values as research data.
Carefully coding categories
The first issue you need to decide is how to code the categories. How are you going
to store the values in the research database? Do you want to enter the type of care
provider as nurse, physician, or social worker; or as N, P, or SW; or as 1 = nurse, 2 =
physician, and 3 = social worker; or in some other manner? Most modern statistical
software can analyze categorical data with any of these representations, but it is
easiest for the analyst if you code the variables using numbers to represent the
categories. Software like SPSS, SAS, and R lets you specify a connection between
number and text (for example, attaching a label to 1 to make it display Nurse on
statistical output) so you can store categories using a numerical code while also
displaying what the code means on statistical output. In general, best practices are
to set conventions and be consistent, and make sure the content and meaning of
each variable is documented. You can also attach variable labels.
Nothing is worse than having to deal with a data set in which a categorical variable
has been stored with numerical codes, but there is no key to the codes and the
person who created the data set is no longer available. This is why maintaining a
data dictionary — described later in this chapter in “Creating a File that Describes
Your Data File” — is a critical step for ensuring you analyze your research data
properly.
Microsoft Excel doesn’t care whether you type a word or a number in a cell, which
can create problems when storing data. You can enter Type of Caregiver as N for the
first subject, nurse for the second, NURSE for the third, 1 for the fourth, and Nurse
for the fifth, and Excel won’t stop you or throw up an error. Statistical programs
like R would consider each of these entries as a separate, unique category. Even
worse, you may inadvertently add a blank space in the cell before or after the text,
which will be considered yet another category. Details such as case-sensitivity of
character values (meaning patterns of being upper or lowercase) can impact que-
ries. In Excel, avoid using autocomplete, and enter all levels of categorical vari-
ables as numerical codes (which can be decoded using your data dictionary).